library(tidyverse)
[37m── [1mAttaching packages[22m ──────────────────────────────────── tidyverse 1.2.1 ──[39m
[37m[32m✔[37m [34mggplot2[37m 3.2.1 [32m✔[37m [34mpurrr [37m 0.3.2
[32m✔[37m [34mtibble [37m 2.1.3 [32m✔[37m [34mdplyr [37m 0.8.3
[32m✔[37m [34mtidyr [37m 0.8.3 [32m✔[37m [34mstringr[37m 1.4.0
[32m✔[37m [34mreadr [37m 1.3.1 [32m✔[37m [34mforcats[37m 0.4.0[39m
[37m── [1mConflicts[22m ─────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[37m [34mdplyr[37m::[32mfilter()[37m masks [34mstats[37m::filter()
[31m✖[37m [34mdplyr[37m::[32mlag()[37m masks [34mstats[37m::lag()[39m
library(Hmisc)
Loading required package: lattice
Loading required package: survival
Loading required package: Formula
Attaching package: ‘Hmisc’
The following objects are masked from ‘package:dplyr’:
src, summarize
The following objects are masked from ‘package:base’:
format.pval, units
library(plotly)
Attaching package: ‘plotly’
The following object is masked from ‘package:Hmisc’:
subplot
The following object is masked from ‘package:ggplot2’:
last_plot
The following object is masked from ‘package:stats’:
filter
The following object is masked from ‘package:graphics’:
layout
appstore_df <- read.csv("../Data/AppleStore.csv")
appstore_df %>% arrange(X)
describe(appstore_df)
appstore_df
17 Variables 7197 Observations
-----------------------------------------------------------------------------
X
n missing distinct Info Mean Gmd .05 .10
7197 0 7197 1 4759 3550 405.8 823.6
.25 .50 .75 .90 .95
2090.0 4380.0 7223.0 9424.4 10166.4
lowest : 1 2 3 4 5, highest: 11081 11082 11087 11089 11097
-----------------------------------------------------------------------------
id
n missing distinct Info Mean Gmd .05
7197 0 7197 1 863130997 297160703 3.639e+08
.10 .25 .50 .75 .90 .95
4.227e+08 6.001e+08 9.781e+08 1.082e+09 1.131e+09 1.151e+09
lowest : 281656475 281796108 281940292 282614216 282935706
highest: 1187617475 1187682390 1187779532 1187838770 1188375727
-----------------------------------------------------------------------------
track_name
n missing distinct
7197 0 7195
lowest : _PRISM -The 穴通し3D- 君の記憶力x反射神経を問う! ~Mr.CURVEからの挑戦状 ~ :) Sudoku + ! OH Fantastic Free Kick + Kick Wall Challenge . Calculator .
highest: 鴨川等間隔の法則 麻雀物語3 役満乱舞の究極大戦 黄金日-贵金属理财投资黄金白银 龙之觉醒-热血经典RPG,回味激燃岁月 龙珠直播-高清游戏娱乐直播平台
-----------------------------------------------------------------------------
size_bytes
n missing distinct Info Mean Gmd .05
7197 0 7107 1 199134454 2.47e+08 12234138
.10 .25 .50 .75 .90 .95
20303053 46922752 97153024 181924864 414828134 824317542
lowest : 589824 618496 671744 698800 709632
highest: 3896109056 3956326400 3968637952 3975609344 4025969664
-----------------------------------------------------------------------------
currency
n missing distinct value
7197 0 1 USD
Value USD
Frequency 7197
Proportion 1
-----------------------------------------------------------------------------
price
n missing distinct Info Mean Gmd .05 .10
7197 0 36 0.818 1.726 2.632 0.00 0.00
.25 .50 .75 .90 .95
0.00 0.00 1.99 4.99 6.99
lowest : 0.00 0.99 1.99 2.99 3.99, highest: 59.99 74.99 99.99 249.99 299.99
-----------------------------------------------------------------------------
rating_count_tot
n missing distinct Info Mean Gmd .05 .10
7197 0 3185 0.998 12893 23717 0 0
.25 .50 .75 .90 .95
28 300 2793 18278 48107
lowest : 0 1 2 3 4
highest: 1126879 1724546 2130805 2161558 2974676
-----------------------------------------------------------------------------
rating_count_ver
n missing distinct Info Mean Gmd .05 .10
7197 0 1138 0.992 460.4 839.9 0.0 0.0
.25 .50 .75 .90 .95
1.0 23.0 140.0 591.4 1278.4
lowest : 0 1 2 3 4, highest: 88478 94315 107245 117470 177050
-----------------------------------------------------------------------------
user_rating
n missing distinct Info Mean Gmd .05 .10
7197 0 10 0.934 3.527 1.462 0.0 0.0
.25 .50 .75 .90 .95
3.5 4.0 4.5 4.5 5.0
Value 0.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
Frequency 929 44 56 106 196 383 702 1626 2663 492
Proportion 0.129 0.006 0.008 0.015 0.027 0.053 0.098 0.226 0.370 0.068
-----------------------------------------------------------------------------
user_rating_ver
n missing distinct Info Mean Gmd .05 .10
7197 0 10 0.955 3.254 1.866 0.0 0.0
.25 .50 .75 .90 .95
2.5 4.0 4.5 5.0 5.0
Value 0.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
Frequency 1443 125 74 136 176 304 533 1237 2205 964
Proportion 0.201 0.017 0.010 0.019 0.024 0.042 0.074 0.172 0.306 0.134
-----------------------------------------------------------------------------
ver
n missing distinct
7197 0 1590
lowest : 0.0.15 0.13 0.14.87 0.16.3 0.17.518
highest: v1.865 v2.13.9 v2.2.21 v3.6.9 V3.7.0
-----------------------------------------------------------------------------
cont_rating
n missing distinct
7197 0 4
Value 12+ 17+ 4+ 9+
Frequency 1155 622 4433 987
Proportion 0.160 0.086 0.616 0.137
-----------------------------------------------------------------------------
prime_genre
n missing distinct
7197 0 23
lowest : Book Business Catalogs Education Entertainment
highest: Social Networking Sports Travel Utilities Weather
-----------------------------------------------------------------------------
sup_devices.num
n missing distinct Info Mean Gmd .05 .10
7197 0 20 0.884 37.36 2.945 25.8 37.0
.25 .50 .75 .90 .95
37.0 37.0 38.0 40.0 43.0
Value 9 11 12 13 15 16 23 24 25 26 33
Frequency 1 3 1 7 2 8 1 270 67 42 2
Proportion 0.000 0.000 0.000 0.001 0.000 0.001 0.000 0.038 0.009 0.006 0.000
Value 35 36 37 38 39 40 43 45 47
Frequency 24 7 3263 1912 40 1142 371 8 26
Proportion 0.003 0.001 0.453 0.266 0.006 0.159 0.052 0.001 0.004
-----------------------------------------------------------------------------
ipadSc_urls.num
n missing distinct Info Mean Gmd
7197 0 6 0.747 3.707 1.876
Value 0 1 2 3 4 5
Frequency 1387 155 156 286 710 4503
Proportion 0.193 0.022 0.022 0.040 0.099 0.626
-----------------------------------------------------------------------------
lang.num
n missing distinct Info Mean Gmd .05 .10
7197 0 57 0.856 5.435 6.745 1 1
.25 .50 .75 .90 .95
1 1 8 15 21
lowest : 0 1 2 3 4, highest: 63 68 69 74 75
-----------------------------------------------------------------------------
vpp_lic
n missing distinct Info Sum Mean Gmd
7197 0 2 0.021 7147 0.9931 0.0138
-----------------------------------------------------------------------------
appstore_df <- mutate(appstore_df, size_mb = size_bytes/1000000)
appstore_df <- mutate(appstore_df, is_free = price == 0)
appstore_df <- mutate(appstore_df, user_rating_string = as.character(user_rating))
head(appstore_df)
We will first add certain variables to the data set that are more relevant (like total size of an app in megabytes and if the app is free/paid).
appstore_df %>% ggplot() + geom_boxplot(mapping = aes(x=prime_genre, y = rating_count_tot, color=prime_genre)) + scale_y_log10() + coord_flip() + labs(y="Rating Count (logarithmic)") + labs(x="App Primary Genre") + theme_minimal()
Since a total downlaod count is not available in this dataset due to restrictions from Apple, we will treat the total rating count of an app to be a rough estimate of its total downloads. A higher rating count implies that the app is more successful. We can see how the popularity of every app genre with the graph above. The app genre plays a minor impact in the success of an app (exception is the Book genre).
appstore_df %>% ggplot() + geom_bar(mapping=aes(x=is_free, fill=is_free)) + labs(x="App is Free") + theme_minimal()
We can see that there are more free apps in the Apple App Store than paid apps, which can range from $0.99 to $299.99.
appstore_df %>% ggplot() + geom_col(mapping=aes(x=is_free, y=rating_count_tot, fill=is_free)) + labs(x="App is Free") + theme_minimal()
The total rating count, which implies the total download count, of free apps are much higher than paid apps. This ratio is significantly higher than the number of free apps to paid apps, which implies that free apps are typically more successful in the App Store.
appstore_df %>% ggplot() + geom_histogram(mapping = aes(x=size_mb), bins = 60) + geom_vline(mapping= aes(xintercept= median(appstore_df$size_mb)), color="#873096") + geom_vline(mapping= aes(xintercept= mean(appstore_df$size_mb), color="#ff6159")) + theme_minimal()
appstore_df %>% ggplot() + geom_point(mapping = aes(x=user_rating_string, y=log10(rating_count_tot), color=prime_genre == "Games", alpha=0.8)) + theme_minimal()
We can see the distribution of the app sizes throughout the ~7000 apps in the data set. The red line represents the mean of the app sizes in megabytes, while the purple line represents the median of the app sizes in megabytes. We can see that most apps are under the 500 MB threshold.
appstore_df %>% ggplot() + geom_point(mapping = aes(y=size_mb, x=log10(rating_count_tot), color=prime_genre == "Games", alpha=0.8)) + theme_minimal()
plot_ly(
data=appstore_df,
x = ~log10(rating_count_tot),
y = ~size_mb,
color=~prime_genre
)
No trace type specified:
Based on info supplied, a 'scatter' trace seems appropriate.
Read more about this trace type -> https://plot.ly/r/reference/#scatter
No scatter mode specifed:
Setting the mode to markers
Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
n too large, allowed maximum for palette Set2 is 8
Returning the palette you asked for with that many colors
n too large, allowed maximum for palette Set2 is 8
Returning the palette you asked for with that many colors
No trace type specified:
Based on info supplied, a 'scatter' trace seems appropriate.
Read more about this trace type -> https://plot.ly/r/reference/#scatter
No scatter mode specifed:
Setting the mode to markers
Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
n too large, allowed maximum for palette Set2 is 8
Returning the palette you asked for with that many colors
n too large, allowed maximum for palette Set2 is 8
Returning the palette you asked for with that many colors
This scatterplot outlines many aspects about the data that we’re investigating. The most striking aspect is that most of the apps in the dataset fall into the Games category, as you can see with the blue coloring. Also, most of the apps with greater than 1 GB download size are in fact games. Every other app category mostly falls under the 1 GB download size. In addition, the super-popular apps to the far right of the graph typically do not exceed the 2 GB threshold.
appstore_df %>% ggplot() + geom_boxplot(mapping = aes(x=cont_rating, y = rating_count_tot, color=cont_rating)) + scale_y_log10() + coord_flip() + labs(y="Rating Count (logarithmic)") + labs(x="Content Rating") + theme_minimal()
We could see that content rating plays a very minor roll in determining the success of an app.
appstore_df %>% ggplot() + geom_boxplot(mapping = aes(x=user_rating_string, y = rating_count_tot, color=user_rating_string)) + scale_y_log10() + coord_flip() + labs(y="Rating Count (logarithmic)") + labs(x="User Review Rating") + theme_minimal()
From the boxplot above, we can see a clear growth in total ratings (which gives a rough indication of total downloads) until the user ratings goes above 1.5 stars. However, from 2 - 5 stars, the difference in rating count cannot be determined from the graph.